Background

Loan default prediction is important because it helps lenders and financial institutions assess the risk of granting loans to borrowers. By predicting the likelihood of a borrower defaulting on a loan, lenders can make informed decisions on whether to approve the loan, how much to lend, and at what interest rate. This helps to reduce the risk of financial losses due to default and can improve the overall stability of the financial system.

library(mosaic)
library(car)
library(DT)
library(pander)
library(readr)
library(tidyverse)
library(ResourceSelection)
default <- read.csv("C:/Users/tybin/OneDrive/Desktop/MATH 325/Statistics-Notebook-master/Data/loan_default.csv", header=TRUE)
#create a number month column
default <- default %>% 
  mutate(
    num_month = case_when(
      issue_month == "Dec" ~ 12,
      issue_month == "Nov" ~ 11,
      issue_month == "Oct" ~ 10,
      issue_month == "Sep" ~ 9,
      issue_month == "Aug" ~ 8,
      issue_month == "Jul" ~ 7,
      issue_month == "Jun" ~ 6,
      issue_month == "May" ~ 5,
      issue_month == "Apr" ~ 4,
      issue_month == "Mar" ~ 3,
      issue_month == "Feb" ~ 2,
      issue_month == "Jan" ~ 1,
    ),
    income_loan_ratio = loan_amnt/annual_inc
  ) %>%
  select(-issue_month)




set.seed(121)
n <- nrow(default)

keep <- sample(1:n, n*0.8)
mytrain <- default[keep, ]
mytest <- default[-keep, ]

Hide Data

Show Data

This data set was collected from Github repository. In the case of this data the default column: 1 means they paid off their loan and 0 is the opposite. There are about 39,685 data points within this table.

My Training Data Columns

We split up our data into two different groups consisting of a training model and a test model, we do this to mimic the aspect of predicteding future values on new data.

purpose_prices <- mytrain %>%
  group_by(purpose)%>%
  summarise(
    mean = mean(loan_amnt),
    first_quart = quantile(loan_amnt, 0.25),
    third_quart = quantile(loan_amnt,0.75),
    std = sd(loan_amnt),
    upper_band = mean + 2*(std)
  )

mytrain1 <-mytrain %>%
  left_join(purpose_prices, join_by(purpose)) %>%
  mutate(
    amnt_mean_diff = loan_amnt - mean,
    upper_bound_diff = loan_amnt - upper_band,  
    above_box_diff= loan_amnt - third_quart,
    own_home = ifelse(home_ownership == "OWN",1,0),
    debt_consolidation = ifelse(purpose == "debt_consolidation",1,0),
    other = ifelse(purpose == "other",1,0),
    rent = ifelse(home_ownership == "RENT",1,0),
    ca = ifelse(addr_state == "CA",1,0),
    tx = ifelse(addr_state == "TX",1,0),
    other_state = ifelse(addr_state == "Other",1,0),
    ga = ifelse(addr_state == "GA",1,0),
    pa = ifelse(addr_state == "PA",1,0),
    verif = ifelse(verification_status == "Verified",1,0),
    not_verif = ifelse(verification_status == "Not Verified",1,0),
    source_verif = ifelse(verification_status == "Source Verified",1,0)
    
  )
selected <- mytrain1 %>%
  select(inq_last_6mths, delinq_2yrs, upper_bound_diff, revol_bal, income_loan_ratio, default)


datatable(selected, options=list(lengthMenu = c(3,10,30),scrollY=300,scroller=TRUE,scrollX=TRUE), 
            extensions="Scroller")

Full Data

Here’s the full data not for our selected columns.

datatable(mytrain1, options=list(lengthMenu = c(3,10,30),scrollY=300,scroller=TRUE,scrollX=TRUE), 
            extensions="Scroller")

Exploring Different Parameters

I was curious to look at the different purposes of taking out loans and how they are distributed, and the possible uses with these variables.

I want to look at the purpose for the loan as it might give us some insight for the odds of it getting paid back, my thought proccess is with aspects where the loan is for a small business or an investment where the moneys being used to make more money there’d be a higher probability of the loan being paid back as compared to someone using the loan to buy a toy.

ggplot(mytrain1, aes(x= purpose, y=(default), fill= purpose)) + 
  geom_bar(stat="identity") + 
  theme_minimal()+
  theme(
    axis.line = element_line(color = "black")
  ) + 
  labs(
    title= "Count of Defaults By Purpose",
    x= "Loan Purpose",
    y="Number of Defaults"
  )

We can see in the graph above that debt_consolidation has the highest number of defaulted loans, and other is the next largest.

After looking at these purpose categories I wanted to take a look at the range of prices these different categories of “purpose” have.

ggplot(mytrain1, aes(x=reorder(purpose, desc(loan_amnt)), y= loan_amnt, fill=purpose)) +
  geom_boxplot() + 
  theme_minimal() + 
  labs(
    title= "Loan Amount Distribution",
    x= "Loan Purpose",
    y="Loan Amount",
    fill="Loan Purpose"
  ) + 
  theme(
    axis.line = element_line(color = "black")
  )

Above we can see the distributions for each category, in an ordered fashion.

My theory that when if the loan value is outside two standard deviations of the mean loan amount for each category making the purchase unusually expensive for the category, that there would be a higher chance of the loan not being paid back.

I created this metric by creating a summary table showing the summary statistics of the purpose groups and creating a upper band consisting of the mean + 2*(standard deviation).

purpose_prices %>%
  select(mean, std,upper_band) %>%
  pander()
mean std upper_band
6814 4115 15044
11758 7061 25879
12705 7472 27649
11602 8212 28026
8201 6181 20564
8100 6520 21139
13684 8686 31056
9873 6198 22270

From there I was able to assign the values associated with each group by using “left_join” to have each row assigned the appropriate upper bound for our data.

After this I subtracted the loan amount by the upper band to determine how far out of the upper bands range the loan amount is.

Giant Pairs Plot

After taking a look at some theories I decided to create a giant pairs plot to take a better look at the data given and possible incites into determinants of a defaulted loan.

Here’s a picture of the annotated pairs plot:

After further consideration from the categories in the pairs plot I decide to end up using the following categories:

  • inq_last_6mths : The borrower’s number of inquiries by creditors in the last 6 months.

  • delinq_2yrs : The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

  • upper_bound_diff : Personally created section demonstrating an unusually expensive loan within the “purpose” category.

  • revol_bal : The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).

  • income_loan_ratio : Personally created column demonstrating what percentage the loan is of the persons annual income.

Multiple Logistic Regression Model

The probability of the loaning being paid back is based on the columns sitting above will be calculated by the following logistic regression model.

\[ P(Y_i = 1|\, x_{i1},x_{i2},x_{i3},x_{i4},x_{i5}) = \frac{e^{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5}}}{1+e^{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5}}} = \pi_i \]

where

Variable Value Explanation
\(x_{i1}\) inq_last_6mths The borrower’s number of inquiries by creditors in the last 6 months.
\(x_{i2}\) delinq_2yrs The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
\(x_{i3}\) upper_bound_diff Personally created section demonstrating an unusually expensive loan within the “purpose” category.
\(x_{i4}\) revol_bal The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
\(x_{i5}\) income_loan_ratio Personally created column demonstrating what percentage the loan is of the persons annual income.

In this model, for each previous appproved loan \(i\):

If \(\beta_1\) is zero in the above model, then the \(x\) values provide no insight about the probability of a defaulted loan. Using a significance level of \(\alpha = 0.05\) we will test the below hypotheses about our beta values.


\[ H_0: \beta_1 = 0 \text{ (inq_last_6mths has no effect)} \\ H_a: \beta_1 \neq 0 \text{ (inq_last_6mths has an effect)} \]


\[ H_0: \beta_2 = 0 \text{ (delinq_2yrs has no effect)}\\ H_a: \beta_2 \neq 0 \text{ (delinq_2yrs has an effect)} \]


\[ H_0: \beta_3 = 0 \text{ (upper_bound_diff has no effect)}\\ H_a: \beta_3 \neq 0 \text{ (upper_bound_diff has an effect)} \]


\[ H_0: \beta_4 = 0 \text{ (revol_bal has no effect)}\\ H_a: \beta_4 \neq 0 \text{ (revol_bal has an effect)} \]


\[ H_0: \beta_5 = 0 \text{ (income_loan_ratio has no effect)}\\ H_a: \beta_5 \neq 0 \text{ (income_loan_ratio has an effect)} \]


Visualizing the Model

Below we can see a visual of the logistic model divided by it’s different parameters for deeper insight as the scope of each differs and can make interpretation largely difficult.

rev_glm <- glm(default ~ inq_last_6mths + delinq_2yrs + upper_bound_diff + revol_bal + income_loan_ratio, mytrain1, family="binomial")
library(ggplot2)
#plotting the info 
b <- coef(rev_glm)
ggplot(mytrain1)+
  geom_point(aes(x=inq_last_6mths, y=default, color="inq_last_6mths"))+
  geom_function(fun=function(x){exp(b[1]+b[2]*x)/(1+exp(b[1]+b[2]*x))}, aes(color="inq_last_6mths")) + 
  geom_point(aes(x=delinq_2yrs, y=default, color="delinq_2yrs"))+
  geom_function(fun=function(x){exp(b[1] + b[3]+b[2]*x)/(1+exp(b[1]+b[3]+b[2]*x))}, aes(color="delinq_2yrs")) + 
  theme_bw()

  # geom_point(aes(x=upper_bound_diff, y=default, color="upper_bound_diff"))+
  # geom_function(fun=function(x){exp(b[1] + b[4]+b[2]*x)/(1+exp(b[1]+b[4]+b[2]*x))}, aes(color="upper_bound_diff")) 
ggplot(mytrain1)+
  geom_point(aes(x=upper_bound_diff, y=default, color="upper_bound_diff"))+
  geom_function(fun=function(x){exp(b[1] + b[4]+b[2]*x)/(1+exp(b[1] +b[4]+b[2]*x))}, aes(color="upper_bound_diff")) + 
  theme_bw()

ggplot(mytrain1)+
  geom_point(aes(x=revol_bal, y=default, color="revol_bal"))+
  geom_function(fun=function(x){exp(b[1] + b[5]+b[2]*x)/(1+exp(b[1] +b[5]+b[2]*x))}, aes(color="revol_bal")) + 
  theme_bw()

ggplot(mytrain1)+
  geom_point(aes(x=income_loan_ratio, y=default, color="income_loan_ratio"))+
  geom_function(fun=function(x){exp(b[1] + b[6]+b[2]*x)/(1+exp(b[1] +b[6]+b[2]*x))}, aes(color="income_loan_ratio")) + 
  theme_bw()

Evaluating the Model

In this section we are going to evaluate our logistic regression model to determine how well our model works with our training data before we continue to test our model.

pcc <-function(glm_model, df, percent){
  trainpreds1 <- predict(glm_model, df, type="response")
  callit <- ifelse(trainpreds1 > percent, 1, 0)
  table <- table(df$default, callit)
  prob <- (table[1,1] + table[2,2])/ (table[1,1] +table[1,2] + table[2,1]+ table[2,2])
  prob
}

Logistic Regression Summary Table

summary(rev_glm) %>%
  pander()
  Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.87 0.07219 39.75 0
inq_last_6mths -0.1885 0.01395 -13.52 1.228e-41
delinq_2yrs -0.1466 0.03057 -4.796 1.622e-06
upper_bound_diff 1.591e-05 2.682e-06 5.931 3.02e-09
revol_bal -3.647e-06 1.078e-06 -3.383 0.0007165
income_loan_ratio -3.06 0.1538 -19.89 4.539e-88

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 26154 on 31747 degrees of freedom
Residual deviance: 25575 on 31742 degrees of freedom

In the pandered summary table above we can see that all of our beta values do not equal zero, meaning that they all contribute to our data. Even more so all of our p-values sit below our alpha of 0.05 showing the significance of each one of our features.

Goodness of Fit Test

We use our Hosmer-Lemeshow test to determine the goodness of fit for our logistic regression model.

hoslem.test(rev_glm$y,rev_glm$fitted, g=10) %>%
  pander()
Hosmer and Lemeshow goodness of fit (GOF) test: rev_glm$y, rev_glm$fitted
Test statistic df P value
6.173 8 0.6279

We determine whether our model should be thrown out if the P-value is below 0.05, and in this situation the \(p = 0.6279\) howing that there is insufficient evidence to reject the null, so we conclude that our model is a good logistic fit for our data.

Training Prediction Table

trainpreds1 <- predict(rev_glm, mytrain1, type="response")
  callit <- ifelse(trainpreds1 > 0.8, 1, 0)
  table <- table(mytrain1$default, callit)
  prob <- (table[1,1] + table[2,2])/ (table[1,1] +table[1,2] + table[2,1]+ table[2,2])
  table %>% 
    pander()
  0 1
0 915 3652
1 2838 24343
# pcc(rev_glm, mytrain1, 0.8)

when we look at our prediction table we can see that when we set our standards at a 80% value for determining whether a loan will default or not, we correctly predicted \(\text{915}\) to pay us back and \(\text{24343}\) to default, and we incorrectly predicted \(\text{2838}\) people to pay us back and \(\text{3652}\) to default when they didn’t.

PCC Test

pander(prob)

0.7956 When we go through an calculate the percentage of values we guessed right out of the overall guesses we get the value above: \(\text{0.7955777}\) or $\(\text{79.5577674}\)%$

Prediction

mytest1 <- mytest %>%
  left_join(purpose_prices, join_by(purpose)) %>%
  mutate(
    amnt_mean_diff = loan_amnt - mean,
    upper_bound_diff = loan_amnt - upper_band,  
    above_box_diff= loan_amnt - third_quart,
    own_home = ifelse(home_ownership == "OWN",1,0),
    debt_consolidation = ifelse(purpose == "debt_consolidation",1,0),
    other = ifelse(purpose == "other",1,0),
    rent = ifelse(home_ownership == "RENT",1,0),
    ca = ifelse(addr_state == "CA",1,0),
    tx = ifelse(addr_state == "TX",1,0),
    other_state = ifelse(addr_state == "Other",1,0),
    ga = ifelse(addr_state == "GA",1,0),
    pa = ifelse(addr_state == "PA",1,0),
    verif = ifelse(verification_status == "Verified",1,0),
    not_verif = ifelse(verification_status == "Not Verified",1,0),
    source_verif = ifelse(verification_status == "Source Verified",1,0)
    
    
  )
testpreds <- predict(rev_glm, mytest1, type="response")

callit <- ifelse(testpreds > 0.8, 1, 0) #you can put whatever you want for the 0.9 value
# length(mytest$default)
# length(callit)

table <- table(mytest1$default, callit)
pc <- (table[1,1] + table[2,2])/ (table[1,1]+table[1,2] + table[2,1] + table[2,2])
pander(pc)

0.8017

When we go through the same process above we see that we get the value above showing the accuracy of our trained model on our test data.

Applicability for Creditors

To determine the applicability of this model we can take another look at our beta values that coincide with our different categories.

When looking at the table it is important to remember that we can calculate the odds ratio through \(OR_i = e^{\beta_i}\). This ratio represents the change in odds for a on-unit change in \(x_i\).

temporary <- data.frame(
  "Variable" = c("inq_last_6mths", "delinq_2yrs", "upper_bound_diff", "revol_bal", "income_loan_ratio"),
  "Beta Values" = c(b[2], b[3], b[4], b[5], b[6] ),
  "Odds Ratio" = c(exp(b[2]), exp(b[3]),exp( b[4]),exp( b[5]), exp(b[6]))
)
pander(temporary)
  Variable Beta.Values Odds.Ratio
inq_last_6mths inq_last_6mths -0.1885 0.8282
delinq_2yrs delinq_2yrs -0.1466 0.8636
upper_bound_diff upper_bound_diff 1.591e-05 1
revol_bal revol_bal -3.647e-06 1
income_loan_ratio income_loan_ratio -3.06 0.0469

Variable Interpretation

Credit Inqueries Within 6 Months

Our Credit Inquiries variable displays the amount of credit searches that different institutions have collected.

We can see that the Odds Ratio sits around \(\text{0.8281937}\) meaning that for every added inquiry within 6 months the odds of getting getting paid back in full decrease by a factor of \(\text{0.8281937}\).

Delinquencies Within Two Year

Our Delinquencies variable displays the number of times the borrower had been 30+ days past due on a payment in the past 2 years.

We can see that the Odds Ratio sits around \(\text{0.8636231}\) meaning that for every added delinquency the odds of getting getting paid back in full decrease by a factor of \(\text{0.8636231}\)

Upper Bound Difference

Personally created section demonstrating an unusually expensive loan within the “purpose” category.

We can see that the Odds Ratio sits around \(\text{1.0000159}\) meaning that for every dollar increase in the loan above the upper bounds the odds of getting paid back in full decrease by a factor of \(\text{1.0000159}\)

Revolving Balance

The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).

We can see that the Odds Ratio sits around \(\text{0.9999964}\) meaning that for every dollar increase in the revolving balance the odds of getting paid back in full decrease by a factor of \(\text{0.9999964}\).

Income Loan Ratio

Personally created column demonstrating what percentage the loan is of the persons annual income.

We can see that the Odds Ratio sits around \(\text{0.0468979}\) meaning that for every added unit the odds of default decrease by a factor of \(\text{0.0468979}\).